Skip to content

Conversation

@emkornfield
Copy link
Collaborator

@emkornfield emkornfield commented Sep 10, 2025

🥞 Stacked PR

Use this link to review incremental changes.


This PR adds a with_data_change method to apply whether a transaction represents a data change to all the file additions (and in the future removals). This helps prevents clients from making accidentally mixing files marked with dataChange = true or false, and is in the long term the direction we want to go in.

In the future we will also like want to physically record dataChange at the commit level, and move away from per file denotation.

Some implications of this change:

  • The schema of add_files is now dependent on whether this flag is sent (this is to allow connectors to maintain backwards compatibility).
  • Change the default engine to no longer pass through dataChange at the file level but instead require using the new setter.

Testing:

  1. Existing FFI tests verify backwards compatibility.
  2. Write tests now call this method with the default engine and no change of output happens.
  3. An additional test is added to verify setting the flag to false, writes the appropriate value on add actions.

BREAKING CHANGE: Engines must use with_data_change on the transaction level instead of passing it to the method. add_files_schema is moved to be scoped on a the transaction.

@emkornfield emkornfield changed the title wip feat!: Add with_data_change to transaction Sep 10, 2025
@github-actions github-actions bot added the breaking-change Change that require a major version bump label Sep 10, 2025
@emkornfield emkornfield marked this pull request as draft September 11, 2025 18:11
@emkornfield
Copy link
Collaborator Author

Need to look into CI failures.

@codecov
Copy link

codecov bot commented Sep 12, 2025

Codecov Report

❌ Patch coverage is 95.37037% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.96%. Comparing base (1f21e4b) to head (a9159f0).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/transaction/mod.rs 91.30% 0 Missing and 2 partials ⚠️
kernel/src/engine/default/parquet.rs 87.50% 0 Missing and 1 partial ⚠️
kernel/src/row_tracking.rs 96.15% 0 Missing and 1 partial ⚠️
kernel/src/transaction/mod.rs 98.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1281      +/-   ##
==========================================
- Coverage   84.96%   84.96%   -0.01%     
==========================================
  Files         114      114              
  Lines       29439    29445       +6     
  Branches    29439    29445       +6     
==========================================
+ Hits        25012    25017       +5     
  Misses       3220     3220              
- Partials     1207     1208       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@emkornfield emkornfield force-pushed the stack/with_data_change branch from 553b09b to 14e87e0 Compare September 13, 2025 01:13
@emkornfield emkornfield marked this pull request as ready for review September 13, 2025 05:14
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good. I'm not sure we need to worry about maintaining backwards compatibility though. Let's just break things :)

// Explicitly define the schema here which duplicates Transaction::add_files_schema, to demonstrate
// engines can still specify data change explicitly (i.e. preserve backwards compatibility for
// non-default engines).
let mut schema_fields = vec![
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just add a static method on Transaction that can return the right schema depending on if you want data change or not, and call that, since you have both correct schemas statically in txn

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to decouple the test from add_schema code in general. I think if we are going to make more modifications to the code maintaining the static would potentially add complexity for test only code. If this is a common pattern in rust, happy to adapt it.

This is also only necessary because we want to test backwards compatibility, if we don't care about it it would be easier to cleanup.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after discussing offline, i've removed backwads compatibility and now pass this schema through from the transaction add_files_schema

@emkornfield
Copy link
Collaborator Author

@nicklan thanks for the review, the code is now simplified to not be backwards compatible. Is the windows failure a known issue (on the surface seems like it might be unrelated to this PR but could be missing something)

@OussamaSaoudi OussamaSaoudi removed the request for review from scovich September 29, 2025 22:41
@emkornfield emkornfield force-pushed the stack/with_data_change branch from b8bbfa5 to 48d49e6 Compare September 29, 2025 22:46
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thanks for simplifying! one more suggestion on the ffi

///
/// # Safety
///
/// Caller is responsible for passing a valid handle. CONSUMES TRANSACTION and commit info
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These apis that consume the argument are inherently more dangerous because there's no compiler on the other side enforcing things like there in pure rust.

Could we just as_ref_mut() it instead, and set the bool on the txn? I think that might require a new method in handle.rs but it should be pretty trivial to write that (and if it's not stop and we can do it this way)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will take a look at doing this. I think the current underlying function consumes the input, so I might need to adjust there as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a a new set_data_change method on Transaction and called it here (renaming this method as well) to avoid consuming the value.

Please let me know if this is what you had in mind. It seems at least at the surface as_mut on handle works here but I guess there might be subtleties

@emkornfield emkornfield requested a review from nicklan September 30, 2025 05:14
@zachschuermann zachschuermann requested a review from scovich October 1, 2025 05:12
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good will finalize tomorrow - assuming we want data_change to be a txn-wide const (not per-file as it was before) cc @nicklan @scovich

Comment on lines 331 to 338
unsafe {
txn_with_engine_info
.shallow_copy()
.as_ref()
.add_files_schema()
}
.as_ref()
.try_into_arrow()?,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: preference for pulling this out to a separate binding instead of inline deep nesting

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 189 to 190
/// Note that the schema does not contain the dataChange column. Users of the default engine
/// must add this column by calling [`crate::transaction::Transaction::with_data_change`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Note that the schema does not contain the dataChange column. Users of the default engine
/// must add this column by calling [`crate::transaction::Transaction::with_data_change`].
/// Note that the schema does not contain the dataChange column. In order to set `data_change` flag, use [`crate::transaction::Transaction::with_data_change`].

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Comment on lines 265 to 284
/// Same as [`Transaction::with_data_change`] but set the value directly instead of
/// using a fluent API.
pub fn set_data_change(&mut self, data_change: bool) {
self.data_change = data_change;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm if this is mostly used for FFI how about just make this internal_api so we don't have many ways to do the same thing? (not a super strong preference, just hoping to simplify user API)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

/// Concretely, it is the expected schema for [`EngineData`] passed to [`add_files`], as it is the base
/// for constructing an add_file. Each row represents metadata about a
/// file to be added to the table. Kernel takes this information and extends it to the full add_file
/// action schema, adding additional fields (e.g., baseRowID) as necessary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// action schema, adding additional fields (e.g., baseRowID) as necessary.
/// action schema, adding internal fields (e.g., baseRowID) as necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

///
/// Note: While currently static, in the future the schema might change depending on
/// options set on the transaction or features enabled on the table.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, thank you so much for the docs improvements!!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh haha realizing this is code movement :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but still appreciate the other docs above etc!)

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, lgtm assuming you address zach's comments!

@emkornfield emkornfield force-pushed the stack/with_data_change branch from 6da6011 to f431f59 Compare October 15, 2025 18:38
Copy link
Member

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, sorry forgot to post some of these comments!

///
/// Note: While currently static, in the future the schema might change depending on
/// options set on the transaction or features enabled on the table.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh haha realizing this is code movement :)

///
/// Note: While currently static, in the future the schema might change depending on
/// options set on the transaction or features enabled on the table.
///
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but still appreciate the other docs above etc!)

@emkornfield emkornfield merged commit 87e3552 into delta-io:main Oct 15, 2025
22 checks passed
samansmink pushed a commit to samansmink/delta-kernel-rs that referenced this pull request Oct 19, 2025
## 🥞 Stacked PR
Use this
[link](https://github.com/delta-io/delta-kernel-rs/pull/1281/files) to
review incremental changes.
-
[**stack/with_data_change**](delta-io#1281)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1281/files)]
-
[stack/add_stats_to_remove](delta-io#1390)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1390/files/a9159f03f6260388d7bbf9f4fd2ca06f8db21f8f..7637fb90686733d81815f4fae5f9e47880f55a4f)]
-
[stack/remove_files](delta-io#1353)
[[Files
changed](https://github.com/delta-io/delta-kernel-rs/pull/1353/files/7637fb90686733d81815f4fae5f9e47880f55a4f..94e2108cbea608b98284c634807b4f265cddfa29)]

---------
This PR adds a with_data_change method to apply whether a transaction
represents a data change to all the file additions (and in the future
removals). This helps prevents clients from making accidentally mixing
files marked with dataChange = true or false, and is in the long term
the direction we want to go in.

In the future we will also like want to physically record dataChange at
the commit level, and move away from per file denotation.

Some implications of this change:

* The schema of add_files is now dependent on whether this flag is sent
(this is to allow connectors to maintain backwards compatibility).
* Change the default engine to no longer pass through dataChange at the
file level but instead require using the new setter.

Testing:
1.  Existing FFI tests verify backwards compatibility.
2. Write tests now call this method with the default engine and no
change of output happens.
3. An additional test is added to verify setting the flag to false,
writes the appropriate value on add actions.

BREAKING CHANGE: Engines must use with_data_change on the transaction
level instead of passing it to the method. `add_files_schema` is moved
to be scoped on a the transaction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking-change Change that require a major version bump

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants